Project presentation

Ander Barrio Campos(231938), /n Dionysios Dimitreas(s232752), Erikas Mikužis(s223164), Valeria Tedeschi(s231945), Angeliki Vliora(233059)

Introduction

Tidyverse: enhances data manipulation and visualization with a tidy data workflow, fostering code that is

  • readable
  • maintainable
  • reproducible
  • Core packages ggplot2, dplyr, tidyr, readr, broom

Our Dataset

Source: Behavioral Risk Factor Surveillance System (BRFSS) 2015.

Key Features: Health indicators related to diabetes, including:

  • lifestyle factors
  • health outcomes
  • demographic information

:::

Introduction

Research Questions

  1. What are the key predictive variables in diabetes prognosis?

  2. How does gender influence the manifestation and progression of diabetes?

Materials and Methods

Data Cleaning and Augmentation

Data Cleaning

  • Removed Missing Values: df_cleaned <- df |> drop_na()

  • Verified Data Types: column_types <- summarise(df_cleaned, across(everything(), class))

  • Filtered Incorrect Values: Filtered out rows with values outside expected ranges.

Data Augmentation

  • Transformed Variables: Binary to categorical (e.g., Smoker to Smoking Status).

  • Created New Variables: E.g., Habits, Health Risk, based on lifestyle and health indicators.

  • Socio-Economic Class: Derived from income, education, and healthcare status.

Data Analysis

Correlation

  • Between all variables: health related variables correlated between them. Not highly negatively correlated variables.

  • With the target variable: GenHlth, HighBP and BMI most correlated with diabetes variable.

GLM

  • All variables: Creation of a GLM with all numerical variables.

  • Step: Step forward and backward for best variables selection.

  • Results: Lowest AIC achieved with backward model (contains 19 variables).

Data Analysis

PCA + Logistic Regression

  • Selected components: 15 components that reach 80% of explained variability.

  • Logistic regression: Use of those components to perform a diabetes prediction model.

  • Results: Great accuracy with a value of 87%.

New GLMs

  • Men VS. Women: Creation of two different datasets according to sex.

  • Results: Better performance in Men model due to lowest AIC. More importance to general health variables and also to fruit variable. Much better performance than the GLM from first part of analysis.

Results

  • Smoking status affects mostly the youngest age groups
  • Similar behavior for other age groups
  • BMI tends to increase over the age

  • Among Healthy individuals, women have healthier habits
  • More non-diabetic women have an Average lifestyle than men, opposite for the diabetic counterparts.

Results

Analysis part1 results (same as part above)

Results

Analysis part2 results (same as part above)

Discussion

Discussion and key takeaways